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THERMODYNAMIC PROPENSITIES OF AMINO ACIDS IN THE NATIVE STATE 
ENSEMBLE: IMPLICATIONS FOR FOLD RECOGNITION 



[0001] This Applications claims priority to U.S. Provisional Application No. 
60/261,733, which was filed on January 16, 2001. 

[0002] The work herein was supported by grants jfrom the United States 
Government. The United States Government may have certain rights in the invention. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0003] The present invention relates to the field of structural biology. More 

particularly, the present invention relates to a protein database and methods of developing a 

protein database that contains all of the theimodynamic information necessary to encode a 

three-dimensional protein structure. 

IL Related Art 

[0004] It is a longstanding idea that protein structures are the result of an 
amino acid chain finding its global firee energy minimum in the solvent environment 
(Anfinsen, 1973). Several exceptions to this so-called "thermodynamic control" have been 
discovered in recent years, including examples of proteins whose folding may be under 
"kinetic control" (Baker et al, 1992, Cohen, 1999) and proteins requiring information not 
completely contained in the amino acid sequence {e.g,, chaperone-assisted folding (Feldman 
& Frydman 2000, Fink 1999)). Although thermodynamic control is widely accepted as the 
default behavior for correct folding (Jackson, 1998), a detailed understanding of the forces 
involved in thermodynamic control and how atomic interactions relate amino acid sequence 
to the folding and stability of the native structure has still proven elusive. 

[0005] Despite the progress that has been made in protein folding, obstacles 
have prevented an accurate structure prediction algorithm. An obstacle in developing an 
accurate structure prediction algorithm has been the lack of suitable potentials for calculating 
the free energies of different conformations of a given protein molecule. In 1992, high- 
pressure liquid chromatography (HPLC) was used to quantitate the energies of pairwise 
interactions between amino acid side chains (Pochapsky and Gopen, 1992). Yet fiirther, in 
1999, Pochapsky used HPLC to fiuther study the thermodynamic interactions between amino 
acid side chains. A stationary phase was prepared for use in an HPLC. The phase was 
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prepared by derivatizing microparticulate silica gels with functionality mimicking the side 
chain of hydrophobic and amphiphilic amino acid analytes (Pereira de Araujo et ah, 1999). 
Thus, this variation of an HPLC method compares entropies and free energies of interaction 
using different derivatized microparticulate silica gels. 

[0006] The present invention uses a computer-based algorithm to address for 
the first time whether amino acid residue types have distinct preferences for thermodynamic 
environments in the folded native structure of a protein, and whether a scoring matrix based 
solely on thermodynamic information (independent of explicit structural constraints) can be 
used to identify correct sequra^ces that correspond to a particular target fold. This is done by 
means of a unique approach in which the regional stability differences within a protein are 
determined for a database of protems usmg the COREX algorithm (Hilser & Freire, 1996). 
The COREX algorithm generates an ensemble of states using the high-resolution structure as 
a template. Based on the relative probability of the different states in the ensemble, different 
regions of the protein are found to be more stable than others. Thus, the COREX algorithm 
provides access to residue-specific free energies of folding. 

BRIEF SUMMARY OF THE INVENTION 

[0007] One embodiment of the present invention is directed to a system and 
method of developing a protein database that contains all of the thermodynamic information 
necessary to encode a three-dimensional protein structure 

[0008] Another embodiment of the present invention comprises a protein 
database comprising nonhomologus proteins having known residue-specific free energies of 
folding of the proteins. In specific embodiments, the database comprises globular proteins. 

[0009] In further OTibodiments, the database is determined by a computational 
method comprising the step of determining a stability constant from the ratio of the summed 
probability of all states in the ensemble in which a residue j is in a folded conformation to the 
summed probability of all states in which j is in an unfolded conformation according to the 
equation. 
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[0010] Another specific embodiment of the present invention comprises that 
the stabihty constants for the residues are arranged into at least one of the three 
thermodynamic classification groups selected from the group consisting of stability, enthalpy, 
and entropy. 

[0011] hi specific embodiments, the stability thermodynamic classification 
group comprises high stability, medium stability and low stability. More particularly, the 
residues in the high stability classification comprises phenylalanine, tryptophan and tyrosine. 
The residues in the low stabihty classification comprises glycine and prolme. And the 
residues in the medium stability classification comprises asparagine and glutamic acid. 

[0012] Yet further, the enthalpy thermodynamic classification group 
comprises high enthalpy and low enthalpy. Enthalpy comprises a ratio of the contributions of 
polar and apolar components. 

[0013] In another specific embodiment, the entropy thermodynamic 
classification group comprises high entropy and low entropy. Entropy comprises a ratio of 
the contributions of polar and apolar components. 

[0014] In a further embodiment, the stability constants for the residues are 
arrffliged into twelve thermodynamic classifications selected from the group consisting of 
HHH, MHH, LHH, HHL, MHL, LHL, HLL, MLL, LLL, HLH, MLH and LLH. 

[0015] Another embodiment of the present invention is a method of developing 
a protein database comprising the steps of: inputting high resolution structures of proteins; 
generating an ensemble of incrementally different conformational states by combinatorial 
unfolding of a set of predefined folding units in all possible combinations of each protein; 
determining the probability of each said conformational state; calculating a residue-specific 
free energy of each said conformational state; and classifying a stabihty constant into at least 
one thermodynamic classification group selected from the group consisting of stability, 
enthalpy, and entropy. Specifically, the protein database comprises globular and 
nonhomologous proteins. 
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[0016] In specific embodiments, the generating step comprises dividing the 
proteins into folding units by placing a block of windows over the entire sequence of the 
protein and sliding the block of windows one residue at a time. 

[0017] In further specific embodiment, the determining step comprises 
determining the free energy of each of the conformational states in the ensemble; determining 
the Boltzmann weight [Ki = exp(-AGi/RT)] of each state; and determining the probability of 
each state using the equation: 

[0018] In specific embodiments, the calculating step comprises determining the 
energy difference between all microscopic states in which a particular residue is folded aud 
all such states in which it is unfolded using the equation 

[0019] Another embodiment of the present invention is a method of identifying 
a protein fold comprising determining the distribution of amino acid residues in different 
thermodynamic environments corresponding to a known protein structure. Specifically, 
determining the distribution of amino acid residues comprises constructing scoring matrices 
derived of thermodynamic iirformation. The scoring matrices are derived from COREX 
thermodynamic information selected from the group consisting of stability, enthalpy, and 
entropy. 

[0020] The aforementioned embodiments of the present invention may be 
readily implemented as a computer-based system. One embodiment of such a computer- 
based system includes a computer program that receives an input of high resolution structure 
data for one or more proteins. The computer-based program utilizes this data to determine 
the amino acid thermodynamic classifications for the proteins. These amino acid 
thermodynamic classifications may then be stored in a database. The database of the system 
preferably has a data structure with a field or fields for storing a value for an amino acid 
name or amino acid abbreviation, and one or more classification fields for storing a numerical 
value for a thermodynamic classification for a particular amino acid. Additionally, this data 
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structure may have a field for storing a value representing the summed total of each of the 
numerical values for each thermodynamic classification for a particular amino acid. 

[0021] In one embodiment of the inventive system, the computer-based 
program performs a process to generate thermodynamic classifications for a protein which 
includes inputting high resolution structures of proteins, generating an ensemble of 
incrementally different conformational states by combinatorial unfolding of a set of 
predefined folding units in all possible combinations of each protein, detemaining the 
probability of each said conformational state, calculating a residue-specific fi-ee energy of 
each said conformational state, and classifying a stability constant into a thermodynamic 
classification group. Additionally, the computer-based program may have a probability 
determination module to determine the free energy of each of the conformational states in a 
computed ensemble, determine a Boltzmann weight, and then determine the probability of 
each state. 

[0022] Moreover, the computer-based program of the inventive system may 
have a display/reporting module for producing one or more graphical reports to a screen or a 
print-out. Some of these reports include: a display of a three-dimensional protein structure 
based on said amino acid thermodynamic classifications; a scatter-plot of normalized 
frequencies of COREX stability data versus normalized frequencies of average side chain 
surface exposure; and a chart displaying thermodynamic environments for amino acids of a 
protein. 

[0023] Another aspect of the inventive metiaods is that they may be stored as 
computer executable instructions on computer-readable medium. 

[0024] The foregoing has outlined rather broadly the features and technical 
advantages of the present invention in order that the detailed description of the invention that 
follows may be better understood. Additional features and advantages of the invention will 
be described hereinafl;er which form the subject of the claims of the invention. It should be 
appreciated by those skilled in the art that the conception and specific embodiment disclosed 
may be readily utilized as a basis for modifying or designing other structures for canying out 
the same purposes of the present invention. It should also be realized by those skilled in the 
art that such equivalent constructions do not depart from the spirit and scope of the invention 
as set forth in the upended claims. The novel features which are believed to be 
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characteristic of the invention, both as to its organization and method of operation, together 
with further objects and advantages will be better understood from the following description 
when considered in connection with the accompanying figures. It is to be expressly 
understood, however, that each of the figures is provided for the purpose of illustration and 
description only and is not intended as a definition of the Umits of the present invention. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0025] The following drawings form part of the present specification and are 
included to further demonstrate certain aspects of the present invention. The invention may 
be better imderstood by reference to one or more of these drawings in combination with the 
detailed description of specific embodiments presented herein. 

[0026] Figure lA and Figure IB are a schematic description of the COREX 
algorithm applied to the crystal structure of the ovomucoid third domain, OM3 (2ovo). 
Figure lA summarizes the partitioning strategy of the COREX algorithm. Figure 1 B 
illustrates the solvent exposed surface area (ASA) contributing to the energetics of microstate 
32. 

[0027] Figure 2 is a comparison of hydrogen exchange protection factors 
predicted fix)m COREX data with experimental values for ovomucoid third domain (2ovo). 
Unfilled vertical bars denote predicted values, and filled vertical bars denote experimental 
values (Swint-Kruse & Robertson, 1996). The solid line denotes kiKf values. The simulated 
temperature of the COREX calculation was set at 30 "^C to match the experimental conditions. 
Secondary structure is given by labeled horizontal lines. Asterisks show the positions of Thr 
47 and Thr 49, referred to in the text. 

[0028] Figure 3A, Figure 3B, Figure 3C, Figure 3D, Figure 3E, Figure 3F, 
Figure 3G, Figure 3H, Figure 31, Figure 3J, Figure 3K, Figure 3L, Figure 3M, Figure 3N, 
Figure 3N, Figure 30, Figure 3P, Figure 3Q, Figure 3R, Figure 3S and Figure 3T comprise 
normalized frequencies of COREX stability data as a fiinction of amino acid type. Figure 3 A 
shows the data as a function of the amino acid alanine. Figure 3B shows the data as a 
function of the amino acid arginine. Figure 3C shows the data as a function of the amino acid 
asparagine. Figure 3D shows the data as a function of the amino acid aspartic acid. Figure 
3E shows the data as a function of the amino acid cysteine. Figure 3F shows the data as a 
function of the amino acid glutamine. Figure 3G shows the data as a function of the amino 
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acid glutamic acid. Figure 3H shows the data as a function of the amino acid glycine. Figure 
31 shows the data as a function of the amino acid histidine. Figure 3J shows the data as a 
fimction of the amino acid isoleucine. Figure 3K shows the data as a function of the amino 
acid leucine. Figure 3L shows the data as a function of the amino acid lysine. Figure 3M 
shows the data as a function of the amino acid methionine. Figure 3N shows the data as a 
function of the amino acid phenylalanine. Figure 30 shows the data as a function of the 
amino acid proline. Figure 3P shows the data as a function of the amino acid serine. Figure 
3Q shows the data as a function of the amino acid threonine. Figure 3R shows the data as a 
jRinction of the amino acid tryptophan. Figure 3S shows the data as a function of the amino 
acid tyrosine. Figure 3T shows the data as a function of the amino acid valine. In each 
histogram, the low stability bin is on the left, the medium stability bin is in the middle, and 
the high stability bin is on the right. The data used in each histogram was taken from the 
2922 residue data set, as given in Table 2. 

[0029] Figure 4 is a scatterplot of normalized frequencies of COREX stability 
data versus normalized frequencies of average side chain surface area exposure. Average 
side chain exposure in the native structure was calculated by using a moving window of five 
residues, similar to the basis of the COREX algorithm. These values were then binned into 
high, medium, and low surface area exposure. 

[0030] Figure 5A, Figure 5B, Figure 5C and Figure 5D illustrate a summary of 
fold-recognition results for COREX stability and DSSP secondary structure scoring matrices 
for 44 targets. Black bars denote real data (either InKf or secondary structure), and striped 
bars denote the average of three random data sets. Figure 5A shows the InKf scoring matrix 
local alignment algorithm. Figure 5B shows the hiKf scoring matrix global aligxunent 
algorithm. Figure 5C shows the secondary structure scoring matrix local ahgnment 
algorithm. Figure 5D shows the secondary structure scoring matrix global alignment 
algorithm. 

[0031] Figure 6A, Figure 6B and Figure 6C illustrate examples of successful 
local ahgnment for three targets. Results for target ligd (Protein G) are shown in Figure 6A, 
results for target Ivcc (DNA topoisomerase I) are shown in Figure 6B, and results for target 
2ait (tendamistat) are shown in Figure 6C. The thin black line represents COREX calculated 
stability data (InKf) for the protein target. The filled circles connected by a thick black line 
correspond to the cumulative matrix score contributed by each residue. Scores that did not 
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contribute to the final score due to the rules of the local alignment algorithm (Smith & 
Waterman, 1981) are shown as unfilled circles connected by a thick dashed line. 

[0032] Figure 7 is a correlation between stability data derived from the database 
of 44 proteins used in this work mad stability data derived from an independent database of 50 
proteins. Data on the x-axis are taken from the normalized histograms in Figure 3A-Figure 
3T. Data on the y-axis are derived from an identical COREX analysis of an independent 
database of 3304 residues from 50 PDB structures not contained in the original database. 
Open circles denote the values for His, a residue type with low statistics in both databases. 
The dashed line represents a perfect correlation. 

[0033] Figure 8A and Figure SB illustrate the resuUs of a COREX calculation 
for the bacterial cold-shock protein cspA (PDB Imjc). Figure 8A shows a plot of calculated 
thermodynamic stability, InK/j, as a function of residue number for cspA. The simulated 
temperature was 25.0°C. Regions of relatively high, medium, and low stabihty, are shown in 
dark gray, light gray, and black, respectively. Secondary structure elements, as defined by 
the program DSSP, (Kabsch and Sander, 1983) are labeled. Figure 8B locates the relative 
calculated stabilities of each residue in the Imjc crystal structure. Note that a given 
secondary structural element is predicted to have varying regions of stability, and that the 
most stable regions of the molecule are often, but not necessarily, within the hydrophobic 
core. 

[0034] Figure 9A, Figure 9B and Figure 9C illustrate a description of protein 
structure in terms of thermodynamic environments. Figure 9A shows the thermodynamic 
environment classification scheme used herein. Three quantities derived from the output of 
the COREX algorithm, stability (k/j), enthalpy ratio (Hratioj), and entropy ratio (Sratioj) 
describe the thermodynamic environment of each residue. Figure 9B shows the 12 
thermodynamic environments defined by this classification scheme in a schematic describing 
protein energetic phase space. Each cube represents a region dominated by certain stability, 
enthalpy, and entropy characteristics. Every residue position in the protein structures used 
herein lies somewhere within this phase space. Figure 9C shows examples of the distribution 
of thermodynamic environments of (Figure 9B) in three proteins with varying types and 
amounts of secondary structure. Note that single secondary structure elements do not exhibit 
unique thermodynamic environments. 
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[0035] Figure lOA, Figure lOB, Figure IOC, Figure lOD, Figure lOE, Figure 
lOF, Figure lOG, Figure lOH, Figure 101, Figure lOJ, Figure lOK and Figure lOL show 3D- 
ID scores relating amino acid types to 12 protein structural thermodynamic environments. 
The three-letter abbreviation in each panel represents the stability, enthalpic, and entropic 
descriptor of the thermodynamic environment. Stability is classified into high, medium and 
low. Entropy and enthalpy are classified into high and low. Figure lOA represents LHH, 
which is a protein thermodynamic environment of low stabiUty, high polar/apolar enthalpy 
ratio, and high conformational entropy/Gibbs' solvation energy ratio. Figure lOB represents 
LHL, which is a protein thermodynamic environment of low stability, high poto/apolar 
enthalpy ratio, and low conformational entropy/Gibbs' solvation energy ratio. Figure IOC 
represents LLH, which is a protein thermodynamic environment of low stability, low 
polar/apolar enthalpy ratio, and liigh conformational entropy/Gibbs' solvation energy ratio. 
Figure lOD represents LLL, which is a protein thermodynamic environment of low stability, 
low polar/apolar enthalpy ratio, and low conformational entropy/Gibbs' solvation energy 
ratio. Figure lOE represents MHH, which is a protein thermodynamic environment of 
medium stability, high polar/apolar enthalpy ratio, and high conformational entropy/Gibbs' 
solvation energy ratio. Figure lOF represents MHL, which is a protein thermodynamic 
environment of medium stability, high polar/apolar enthalpy ratio, and low conformational 
entropy/Gibbs' solvation energy ratio. Figure lOG represents MLH, which is a protein 
thermodynamic environment of medium stability, low polar/apolar enthalpy ratio, and high 
conformational entropy/Gibbs' solvation energy ratio. Figure lOH represents MLL, which is 
a protein thermodynamic environment of medium stabiUty, low polar/apolar enthalpy ratio, 
and low conformational entropy/Gibbs' solvation energy ratio. Figure 101 represents HHH, 
which is a protein thermodynamic environment of high stability, high polar/apolar enthalpy 
ratio, and high conformational entropy/Gibbs' solvation energy ratio. Figure lOJ represents 
HHL, which is a protein thermodynamic environment of high stability, high polar/apolar 
enthalpy ratio, and low conformational entropy/Gibbs' solvation energy ratio. Figure lOK 
represents HLH, which is a protein thermodynamic environment of high stability, low 
polar/apolar enthalpy ratio, and high conformational entropy/Gibbs' solvation energy ratio. 
Figure lOL represents HLL, which is a protein thermodynamic environment of high stabihty, 
low polar/apolar enthalpy ratio, and low conformational entropy/Gibbs' solvation energy 
ratio. 



25112195.1 



9 



[0036] Figure 11 shows fold-recognition results for 81 protein targets using a 
scoring matrix composed of thermodynamic information from protein structures. The 
horizontal axis represents the percentile ranking of the score against the target structure for 
the sequence corresponding to the target structure. For example, the sequence corresponding 
to the target cold-shock protein (PDB Imjc) received the 157* highest score of 3858 
sequences against the cold-shock protein thermodynamic profile. This result placed the 
sequence for the cold-shock protein in the 5th percentile bin in Figure 1 1 . When aligned with 
their respective thermodynamic profiles, the majority (44/81) of sequences scored better than 
99% of the 3858 sequences in the database. 

[0037] Figure 12 shows fold-recognition results for 12 all-beta protein targets 
using a scoring matrix composed of thermodynamic information from 31 all-alpha protein 
structures. ITie horizontal axis represents the percentile ranking of the score against the 
target structure for the sequence corresponding to the target structure. For example, the 
sequence corresponding to the all-beta target tendamistat (PDB Ihoe) received the 26* 
highest score of 3858 sequences against the tendamistat thermodynamic profile. This result 
placed the tendamistat sequence in the 5* percentile bin in Figure 5. All 12 sequences 
corresponding to beta targets scored better against their respective targets than 90% of the 
3858 sequences in the database. 

DETAILED DESCRIPTION OF THE INVENTION 

[0038] It is readily apparent to one skilled in the art that various embodiments 
and modifications may be made to the invention disclosed in this Application without 
departing from the scope and spirit of the invention. 

[0039] As used herein the specification, "a" or "an" may mean one or more. As 
used herein in the claim(s), when used in conjunction with the word "comprising", the words 
"a" or "an" may mean one or more than one. As used herein "another" may mean at least a 
second or more. 

[0040] The term "conformation" as used herein refers various 
nonsuperimposable three-dimensional arrangements of atoms that are interconvertible 
without breaking covalent bonds. 
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[0041] The term "configuration" as used herein refers to different 
conformations of a protein molecule that have the same chiraUty of atoms. 

[0042] The term "database" as used herein refers to a collection of data 
arranged for ease of retrieval by a computer. Data is also stored in a maimer where it is easily 
compared to existing data sets. 

[0043] The term "enthalpy" as used herein refers to a thermodynamic state or 
environment in which the enthalpy of intemal interactions and the hydrophobic entropy 
change the favor of protein folding, thus enthalpy is a thermodynamic component in the 
thermodynamic stability of globular proteins. Enthalpy is a ratio of polar and apolar 

AH ; . 

contributions {H^^^^j = J"'' ). 

[0044] The term "entropy" as used herein refers to a thermodynamic state or 
environment in which the conformation entropy change works against folding of proteins. 
Entropy is a ratio the conformational entropy to total solvation free energy 

[0045] The term "globular protein" as used herein refers to proteins in which 
their polypeptide chains are folded into compact structures. The compact structures are 
unlike the extended filamentous forms of fibrous proteins, A skilled artisan realizes that 
globular proteins have tertiary structures which comprises the secondary structure elements, 
e.g,, hehces, p sheets, or nonregular regions folded in specific arrangements. An example of 
a globular protein includes, but is not Umited to myoglobin. 

[0046] The term "peptide" as used herein refers to a chain of amino acids with a 
defined sequence whose physical properties are those expected from the sum of its amino 
acid residues and there is no fixed three-dimensional structure. 

[0047] The term "polyamino acids" as used herein refers to random sequences 
of varying lengths generally resulting from nonspecific polymerization of one or more amino 
acids. 
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[0048] The term "protein" as used herein refers to a chain of amino acids 
usually of defined sequence and length and three dimensional structure* The polymerization 
reaction, which produces a protein, results in the loss of one molecule of water from each 
amino acid, proteins are often said to be composed of amino acid residues. Natural protein 
molecules may contain as many as 20 different types of amino acid residues, each of which 
contains a distinctive side chain. 

[0049] The term "protein fold" as used herein refers to an organization of a 
protein to form a structure which constrains individual amino acids to a specific location 
relative to the other amino acids in the sequence. One of skill in the art realizes that this type 
of organization of a protein comprises secondary, tertiary and quartemary structures. 

[0050] The term "thermodynamic environment" as used herein refers to the 
various thermodynamic components that contribute to the folding process of a protein. For 
example, stability, entropy and enthalpy thermodynamic environments contribute to the 
folding of a protein. One skilled in the art realizes that the terms ^^thermodynamic 
environment", "thermodynamic classification" or "thermodynamic component" are 
interchangeable. 

[0051] There is a hierarchy of protein structure. The primary structure is the 
covalent structure, which comprises the particular sequence of amino acid residues in a 
protein and any posttranslational covalent modifications that may occur. The secondary 
structure is the local conformation of the polypeptide backbone. The helices, sheets, and 
turns of a protein's secondary structure pack together to produce the three-dimensional 
structure of the protein. The three-dimensional structure of many proteins may be 
characterized as having internal surfaces (directed away from the aqueous environment in 
which the protein is normally found) and extemal surfaces (which are in close proximity to 
the aqueous environment). Through the study of many natural proteins, researchers have 
discovered that hydrophobic residues (such as tryptophan, phenylalanine, tyrosine, leucine, 
isoleucine, valine or methionine) are most frequently found on the internal surface of protein 
molecules. In contrast, hydrophilic residues (such as asparate, asparagine, glutamate, 
glutamine, lysine, arginine, histidine, serine, threonine, glycine, and proline) are most 
frequently found on the extemal protein surface. The amino acids alanine, glycine, serine 
and threonine are encoimtered with equal frequency on both the internal and extemal protein 
surfaces. 
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[0052] An embodiment of the present invention is a protein database 
comprising nonhomologous proteins having known residue-specific free energies of folding 
of the proteins. 

[0053] One of skill in the art is cognizant that the properties of proteins are 
governed by their potential energy surfaces. Proteins exist in a dynamic equilibriimi between 
a folded, ordered state and an unfolded, disordered state. This equilibrium in part reflects the 
interactions between the side chains of amino acid residues, which tend to stabilize the 
protein's structure, and, on the otiier hand, those thermodynamic forces which tend to 
promote the randomization of the molecule. 

[0054] The present invention utilizes a computational method comprising the 
step of determining a stability constant from the ratio of the summed probability of all states 
in the ensemble in which a residue j is in a folded conformation to the summed probability of 
all states in which j is in an unfolded conformation according the equation, 

[0055] One of skill in the art is cognizant that although the stabihty constant is 
defined for each position, the value obtained at each residue is not the energetic contribution 
of that residue. The stabihty constant is a property of the ensemble as a whole. For each 
partially unfolded microstate, the energy difference between it and the fiiUy folded reference 
state is determined by the energetic contributions of all amino acids comprising the folding 
units that are unfolded in each microstate, plus the energetic contributions associated with 
exposing additional (complimentary) surface area on the protein (Figure IB). The stability 
constant thus provides the average thermodynamic environment of each residue, wherein 
surface area, polarity, and packing are implicitly considered. Thus, the stability constant 
provides a thermodynamic metric wherein each of these static structural properties is 
weighted according to its energetic impact at each position. 

[0056] The stability constants for the residues are arranged into three 
classifications of stability selected from the group consisting of high, medium and low. 
Specifically, the residues in the high stability classification comprises phenylalanine, 
tryptophan and tyrosine. The residues in the low stability classification comprises glycine 
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and proline. The residues in the medium stability classification comprises asparagine and 
glutamic acid. 

[0057] In the present invention, the classifications of high, medium and low are 
determined based upon inspection of the hiKf value for each protein in the selected database. 
Thus, one of skill in the art is cognizant that these classifications are relative and may vary 
depending upon the proteins that are selected for the database. One of skill in the art 
recognizes that these classifications can be subclassified by a variety of other parameters, for 
example, but not limited to enthalpy and entropy. Thus, any given position in a structure may 
be represented by two or more parameters, for example, but not limited to low stability (InKf) 
and high enthalpy. Yet fiirther, additional parameters can be used to finther divide the 
categories of enthalpy and entropy, for example, but not limited to conformational entropy, 
solvent entropy, polar enthalpy, apolar enthalpy, polar entropy or apolar entropy. Thus, any 
given position in a structure may have a description such as, but not limited to low stability, 
high apolar enthalpy, high polar enthalpy, medium conformational entropy and high apolar 
entropy. One of skill in the art realizes that these classifications allow for better resolution 
and consequently, better performance in identifying the correct protein fold for a given 
protein sequence or a portion of a given protein sequence. Further one of skill in the art is 
cognizant a protein fold refers to the secondary structure of the protein, which includes 
sheets, helices and turns. 

[0058] Another specific embodiment of the present invention comprises that the 
stability constants for the residues are arranged into at least one of the three thermodynamic 
classification groups selected fi:-om the group consisting of stability, enthalpy, and entropy. 

[0059] Specific embodiments of the present invetion provide that the 
database comprises globular and nonhomologous proteins. A skilled artisan is cognizant that 
globular proteins are used to study protein folding. It is contemplated that the computational 
method of the present invention may be used for a variety of globular proteins including but 
not limiting to glutacorticoid receptor like DNA binding domain, histone, acyl carrier protein 
like, anti LPS facto/RecA domain, lambda repressor like DNA binding domains, EF hand 
like, insulin like bacterial Ig/albumin binding, barrel sandwich hybrid, p-loop containing NTP 
hydrolases, RING finger domain C3HC4, crambin like, ribosomal protem L7/12 C-terminal 
fi:agment, cytochrome c, SAM domain like, KH domain, RNA polymerase subunit H, beta- 
grasp (ubiquitin-like), rubredoxin like, HiPiP, anaphylotoxins (complement system), 
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ferrodoxin like, OB fold, midkine, HMG box, saposin, HPr proteins, knottins, HIV-1 Nef 
protein fragments, thermostable subdomain from chicken villin, SIS/NSl RNA binding 
domain, SH3 like barrel, DNA topoisomerase I domain, IL8 like, de novo designed single 
chain 3 helix bundle, alpha amylase inhibitor tendamistat, CI2 family of serine protease 
inhibitors protease inhibitors, protozoan pheromone proteins, ConA like lectins/glucoanases, 
ovomucoid/PCI- 1 like inhibitors, beta clip, snake toxin like and BPTI like. Other globular 
proteins may be selected from the Protein Data Bank. 

[0060] One of skill in the art also recognizes that the present invention is not 
lunited to small molecular proteins, A skilled artisan is cognizant fliat the computational 
method used in the present invention can be used on larger proteins. Thus, there is not a size 
limit to the proteins that can be used in the present invention. 

[0061] Another embodiment of the present invention is a method of developing 
a protein database comprising the steps of: inputting high resolution structures of proteins; 
generating an ensemble of incrementally different conformations by combinatorial unfolding 
of a set of predefined folding units in all possible combinations of each protein; determining 
the probability of each said conformational state; calculating the residue-specific free energy 
of each conformational state; and classifying a stability constant into at least one 
thermodynamic environment selected &om the group consisting of stabiUty, enthalpy, and 
entropy. 

[0062] La specific embodiments, the generating step comprises dividing the 
proteins into folding units by placing a block of windows over the entire sequence of the 
protein and sliding the block of windows one residue at a time. 

[0063] One of skill in the art is cognizant that the division of a protein into a 
given number of folding units is a partition. Thus, to maximize the number of partially 
folded states, different partitions are used in the analysis. The partitions can be defined by 
placing a block of windows over the entire sequence of the protein. The folding units are 
defined by the location of the windows irrespective of whether they coincide with specific 
secondary structure elements. By slidiag the entire block of windows one residue at a time, 
different partitions of the protein are obtained. For two consecutive partitions, the first and 
last amino acids of each folding unit are shifted by one residue. This procedure is repeated 
until the entire set of partitions has been exhausted. In specific embodiments, windows of 5 
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or 8 amino acid residues are used. One of skill in the art realizes that approximately 10^ 
partially folded conformations can be generated using the COREX algorithm. This value can 
be altered by increasing or decreasing the window size and the size of the protein. For 
example, for the proteins A,6-85, chymotrypsin inhibitor 2 and bamase, windows sizes of 5, 5, 
8 and amino acid residues results in 2.6 x 10^, 0.4 x 10^, and 1.1 x 10^ partially folded 
conformations, respectively. 

[0064] In further embodiments, the determining step comprises determining the 
free energy of each of the conformational states in the ensemble; determining the Boltzmann 
weight [Ki = exp(-AG//RT)] of each state; and determining the probability of each state using 
the equation, 

[0065] Yet further, the calculating step comprises determining the energy 
difference between all microscopic states in which a particular residue is folded aad all such 
states in which it is imfolded using the equation, 

[0066] One of skill in the art is aware that the COREX algorithm generates a 
large number of partially folded states of a protein from the high resolution crystallographic 
or NMR structure (Hilser & Freire, 1996; Hilser & Freire, 1997 and Hilser et aU 1997). In 
this algorithm, the high resolution structure is used as a template to approximate the ensemble 
of partially folded states of a protein. Thus, the protein is considered to be composed of 
different folding units. The partially folded states are generated by folding and unfolding 
these units in all possible combinations. There are two basic assumptions in the COREX 
algorithm: (1) the folded regions in partially folded states are native-like; and (2) the unfolded 
regions are assumed to be devoid of structure or lacking structure. Thermodynamic 
quantities, e.g., AH, AS, ACp, and AG, partition function and probabihty of each state (PO are 
evaliiated using an empirical parameterization of the energetics (Murphy & Freire, 1992; 
Gomez et aU 1995; Hilser et aU 1996; Lee et al, 1994; D'Aquino et al, 1996; and Luque et 
al, 1996). 
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[0067] Yet further, a skilled artisaa is cognizant that the residue specific 
equilibrium provide quantitative agreement with those obtained experimentally from amide 
hydrogen exchange experiments, e.g., hydrogen protection factors (Hilser & Freire, 1996; 
Hilser & Freire, 1997; and Hilser et al, 1997). 

[0068] One of skill in the art realizes that while the residue stability constants 
are purely thermodynamic quantities defined for all residues, the protection factors also 
contain non-thermodynamic contributions and are defined for a subset of residues. 

[0069] Another embodiment of the present invention is a method of identifying 
a protein fold comprising determining the distribution of amino acid residues in different 
thermodynamic environments corresponding to a known protein structure. More particularly, 
determining the distribution of amino acid residues comprises constructing scoring matrices 
derived of thermodynamic information. Specifically, the scoring matrices are derived firom 
COREX thermodynamic information, such as stability, enthalpy, and entropy. Thus, 
COREX-derived thermodynamic descriptors can be used to identify sequences that 
correspond to a specific fold. 

[0070] A skilled artisan recognizes that the COREX algorithm provides a 
means of estimating the energetic variability in the native state of proteins, and uses this 
information to illuminate the relation between amino acid sequence and protein structure. 
Therefore, the thermodynamic information obtained by the COREX algorithm represents a 
fundamental descriptor of proteins that transcends secondary structure classifications. 

[0071] Protein folds can be considered as one of the most basic molecular parts. 
A skilled artisan recognizes that the properties related to protein folds can be divided into two 
parts, intrinsic and extrinsic. The intrinsic properties relates to an individual fold, e.g., its 
sequence, three-dimensional structure and fimction. Extrinsic properties relates to a fold in 
the context of all other folds, e.g., its occurrence in many genomes and expression level in 
relation to that for other folds. 

[0072] Further, one of skill in the art reahzes that other methods well known in 
the art can be used to develop protein databases for example, but not limited to Monte Carlo 
sampling method. The Monte Carlo sampling method is well known and used in the art (Pan 
etal, 2000). 
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EXAMPLES 



[0073] The following examples are included to demonstrate preferred 
embodiments of the invention. It should be appreciated by those skilled in the art that the 
techniques disclosed in the examples which follow represent techniques discovered by the 
inventor to function well in the practice of the invention, and thus can be considered to 
constitute preferred modes for its practice. However, those of skill in the art should, in light 
of the present disclosure, appreciate that many changes can be made in the specific 
embodiments which are disclosed and still obtain a like or similar result without departing 
from the concept, spirit and scope of the invention. 

Example 1 
Selection of proteins used in dataset 

[0074] A database of 44 proteins, 2922 residues total (Table 1), was selected 
from the Protein Data Bank on the basis of biological and computational criteria. The two 
biological criteria were that the proteins be globular and nonhomologous with every other 
member of the set as ascertained by SCOP (Murzin et al, 1995). The first computational 
criterion was that the proteins be small (less than about 90 residues), because the CPU time 
and data storage needs of an exhaustive COREX calculation increased exponentially with the 
chain length. The second computational criterion was that the structures be mostly devoid of 
ligands, metals, or cofactors, as the COREX energy function was not parameterized to 
account for the energetic contributions of non-protein atoms. The database was comprised of 
24 x-ray structures, whose resolution ranged from 2.60 to 1.00 A (median value of 1.65 A). 
Twenty NMR structures completed the database. An independent database of 50 proteins 
(3304 residues total) that were not included in the above set, was created from the PDBSelect 
database (Hobohm & Sander, 1996). This second database was used as a control to check the 
results obtained from the first database, as shown in Figvire 7. 
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Example 2 
Computational Details 

[0075] The database of 44 nonhomologous proteins (Table 1) was analyzed 
using the COREX algorithm. The COREX algorithm (Hilser & Freire, 1996) was run with a 
window size of five residues on each protein in the database. The minimum window size was 
set to four, and the simulated temperature was 25 °C. 

[0076] Briefly, COREX generated an ensemble of partially unfolded 
microstates using the high-resolution structure of each protein as a template (Hilser & Freire, 
1996). This was facilitated by combinatorially unfolding a predefined set of folding imits 
(z,e, residues 1 - 5 are in the first folding unit, residues 6-10 are in the second folding unit, 
etc.). By means of an incremental shift in the boundaries of the folding units, an exhaustive 
enumeration of the partially unfolded species was achieved for a given folding unit size. The 
entire procedure is shown schematically in Figure lA for ovomucoid third domain (0M3), 
one of the proteins in the database (PDB accession code 2ovo). 

[0077] For each microstate i in the ensemble, the Gibbs fi-ee energy was 
calculated fi-om the surface area-based parameterization described previously (D'Aquino, 
1996; Gomez, 1995; Xie, 1994; Baldwin, 1986; Lee, 1994; Habermann, 1996). The 
Boltzmaim weight of each microstate Ki = exp(-AG,/RT)] was used to calculate its 
probability: 

[0078] where the summation in the denominator is over all microstates. From 
the probabilities calculated in Equation 1, an important statistical descriptor of the 
equilibrium was evaluated for each residue in the protein. Defined as the residue stability 
constant, k/j, this quantity was the ratio of the summed probability of all states in the 
ensemble in which a particular residue j was in a folded conformation (SP//) to the summed 
probability of all states in which j was in an unfolded conformation (LPnfj)'- 
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[0079] From the stability constant, a residue-specific free energy was written 

as: 

LGfj-^-RT^lfiK^j (3) 

[0080] Equation 3 reflects the energy difference between all microscopic 
states in which a particular residue was folded and all such states in which it is unfolded. 

[0081] The Gibbs energy for each microstate i relative to the fully folded 
structure was calculated using Equation 4: 

AGz = AHi, solvation -T(ASi, solvation + WASi, conformational) (4) 

[0082] where the calorimetric enthalpy and entropy of solvation were 
parameterized from polar and apolar surface exposure, and the conformational entropy was 
determined as described previously (Hilser & Freire, 1996). The maximum stability for each 
protein was normalized to a common arbitrary value of approximately 6.2 kcal/mol (max \nKf 
= 10.4) by adjusting its conformational entropy factor, W, in Equation 4. The average 
entropy factor required for the normalization was 0.81 ± 0.19 (mean ± s.d.) over the 44 
proteins. It was an empirical observation that adjustment of a stable protein's conformational 
entropy factor did not change the relative patterns of high and low stability regions in the 
structure. 

Example 3 

Comparison of Residue Stability Constant to 
Hydrogen Exchange Protection Factors 

[0083] Prediction of the hydrogen exchange protection factors of tiie residues 
that exchange protons was performed by calculation of the ensemble of P^r/ and P/ccj values. 

[0084] Briefly, the protection factor for any given residue j was defined as the 
ratio of the sum of the probabilities of the states in which residue j was closed, to the sum of 
the probabilities of the states in which residue j was open: 

_ Pclosed,J 
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[0085] The statistical definition of the protection factors has the same form as 
that of the stabiUty constants (equation (2)) and was expressed in terms of the folding 
probabilities as follows: 



PF,=.J2iIZJ2i^ (6) 

Pn,f,j'\- Pf,xc,j 

[0086] The correction term P/jccj was the sum of the probabilities of all states in 
which residue j was folded, yet exchange competent. 

[0087] Figure 2 shows the comparison of hydrogen exchange protection factors 
predicted from COREX data with experimental values for OM3. The agreement in the 
location and relative magnitude of the protection factors with the stability constants for this 
and other proteins suggested that the calculated native state ensemble provided a good 
description of the actual ensemble (Hilser & Freire, 1996). It naturally follows that the 
residue stability constants of a particular protein provided a good description of the 
thermodynamic environment of each residue in that structure. 

[0088] Further inspection of Figure 2 revealed another important feature in the 
pattern of residue stability constants. Namely, the stabiUty constants varied significantly 
across a given secondary structural element, as observed for alpha helix 1 of 0M3, The 
protection factors (and stability constants) were high at tiie N-terminal region of helix 1, but 
decreased over the length of the heUx. This indicated that secondary structure, or other 
structural classifications, do not obligatorily coincide with thermodynamic classifications. 
This result has potentially important consequences for cataloging propensities of amino acids 
in different environments. For example, in OM3 two threonine residues were located in 
different structural environments; Thr 47 was part of the loop that follows alpha heUx 1, 
while Thr 49 was part of beta strand 3. In spite of the different structural enviroimients for 
the two threonine residues, the stabiUty constants and, more importantly, the experimental 
protection factors demonstrated that both residues, to a first approximation, share the same 
thermodjmamic environment. 

Example 4 
Binning of Residue Stability Constants 

[0089] Inspection of each protein's In^^^ data indicated that these were the three 
stability classes: high, mediiun, and low stabiUty. The cutoffs for each stabiUty class were 
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adjusted so that an approximately equal number of residues in the database fell in each class 
(Table 2). The low stability category was defined as hiK/ <= 3.99, the medium stability 
category was defined as 3,99 < biA5'<= 7.14, and the high stability category was defined as 
In^ > 7.14. Statistics of amino acid type as a function of each of these stability categories 
were tabulated (Table 2), and normalized histograms of these numbers are shown in Figure 
3A-Figure 3T. 





Table 2. 


Statistics of Iuk/ Values for 2922 Residues in the Database" 
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41 
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46 


41 


58 


145 
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70 


51 


32 
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Trp 


10 


5 


22 


37 




Tyr 


15 


27 


50 


92 




Val 


48 


79 


71 


198 




Column Total 


971 


971 


980 


2922 



^ The values in this table were used to compute the normalized histograms shown in Figure 
3A-Figure 3T. In addition, these values (minus the values for a given target protein) were 
used to compute the hiK/ scoring matrices. 



[0090] Striking asymmetries were often observed for the histograms of certain 
amino acids across the three stability environments, and these asymmetries were well outside 
the standard deviation of the average of three random data sets. For example, the aromatic 
amino acids Phe, Trp, and Tyr were mostly found in high stability environments, while Gly 
and Pro were overwhelmingly found in low stabihty environments. In contrast, other 
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residues such as Ala, Met, and Ser exhibited distributions that did not significantly differ 
from randomized data. 



[0091] Although the acidic residues Asp and Glu shared a slight tendency to be 
found in medium stabihty environments, it was observed that several amino acid pairs having 
nominally similar chemical characteristics partition differently in the stability environments. 
For example, the basic residues Arg and Lys exhibited opposite stability characteristics: the 
coxmts for Arg increased as the stability class increased, but the counts for Lys decreased as a 
function of stability class. While Asn was found less often in high stability environments, 
Gin was found more often in them. Although the distribution for Ser did not differ 
significantly from the randomized data, Thr occurred more often in low stability 
environments and less often in high stability environments. Somewhat surprisingly, the 
aliphatic amino acids lie, Leu, and Val did not show a general pattern, except perhaps a slight 
disfavoring of low stability environments. 

Example 5 

Calculation of Average Native State Side Chain Area Surface Exposure 

[0092] Average side chain area surface area exposure of residue j over a 
window size of five residues, ASAaveragejy was calculated using Equation 7: 

ASA^gy^gj = — ^ (7) 

[0093] Because Equation 7 was undefined for the first and last two residues in 
each protein, these four residues were ignored in the binning. The cutoffs for each side chain 
area class were adjusted so that an approximately equal number of residues fell in each class. 
The low exposure category was defined as ASAaveragej <= 43.31 A^, the medium exposure 
category was defined as 43.31 < ASAaveragej <= 59.86 A^, and the high exposure category 
was defined as ASAaveragej > 59.86 A^. 

[0094] As shown in Figure 4, frequencies of amino acids found in COREX 
stability enviromnents were not correlated to frequencies of amino acids in exposed surface 
area environments. This was important as it suggested that the thermodynamic information 
calculated by the COREX algorithm was not simply monitoring a static property of the 
structure, but instead was capturing a property of the native state ensemble as a whole. 
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Example 6 
Random DataSets 

[00951 For comparison to the COREX and DSSP data sets from the 44 non- 
homologous proteins in the database, control data sets were constructed by randomizing {i.e., 
shuffling) the calculated stability and the secondary structure data. The random data sets 
therefore contained the same amino acid composition, counts of high, medium, and low 
stabilities, and types of secondary structure, as the real data sets. However, any correlation 
between residue type or secondary structural class was presumably destroyed by 
randomization. To assess internal variability of the data due to differing numbers of coxmts 
of each residue type, the results from three randomized data sets were averaged and standard 
deviations calculated; these data are plotted in Figure 3A-Figure 3T. 

Example 7 
Construction of Scoring Matrices 

[0096] The scoring matrices were calculated as log-odds probabilities of finding 
residue type j in structural environment as described below and in (Bowie et ah, 1991). 
The matrix score, S/j^^ was defined as: 

. S,, = ln^ (8) 

[0097] In Equation 8, Py | k was the probability of finding a residue of type j in 
stabiUty class k {i.e., number of counts of residue type j in stability class k divided by the total 
number of counts of residue type y), and Pjt was the probability of finding any residue in the 
database in stability environment k (i.e., number of residues in stability class k, regardless of 
amino acid type, divided by the total number of residues in the entire database, regardless of 
amino acid type). The structural environment was described by either COREX stability 
information (high, medium, or low Iuk/), or DSSP secondary structure (alpha, beta, or other) 
as given in the target's PDB entry. The fold recognition target was removed from the 
database, and the remaining 43 proteins were used to calculate the scores; therefore, 
information about the target was never included in the scoring matrix. The values in Tables 
3 A and 3B are the average ± standard deviation of all 44 individual scoring matrices. 
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[0098] The scoring matrices derived from COREX stability and secondary 
structure, averaged over all 44 target proteins, are shown in Tables 3A and 3B, respectively. 
The stability matrix scores faithfixUy reflected the histograms shown in Figure 3A-Figure 3T; 
for example, Gly and Fro scored unfavorably in high stability environments but scored 
favorably in low stability environments. Similarly, the secondary structure matrix scores 
followed intuitive notions of secondary structure propensity; for example, Ala scored 
positively in hehcal environments, the aromatics scored positively in beta environments, and 
Gly and Pro scored negatively in both alpha and beta enviroimients. The standard deviations 
in both matrices were generally small as compared to the magnitude of the scores, suggesting 
that the scores were not affected by the removal of any one protein from the database. 

Example 8 
Fold-Recognition Details 

[0099] Fold-recognition expraments were based on the profile method 
pione^ed by Eisenberg and co-workers (Gribskov et al, 1987; Bowie et al, 1991). 

[0100] Briefly, the method characterized each residue position of a target 
protein in terms of a structural environment score derived from analysis of a database of 
known structures. The resulting profile of the target protein was then optimally aUgned to 
each member of a library of amino acid sequences by maximizing the score between the 
sequence and the profile. Two structural environment scoring schemes were developed: one 
based on calculated COREX stability, and one based on DSSP secondary structure (Kabsch 
& Sander, 1983) as contained in each target protein's PDB file. Each scoring scheme had 
three dimensions as a fimction of the 20 amino acids: high, medium, and low stability for 
COREX scoring, or alpha, beta, and other for secondary structure scoring. Two alignment 
algorithms were used: a local scheme (Smith & Waterman, 1981) as implemented in the 
PROFILESEARCH software package (Bowie et al, 1991), md a global scheme. The global 
ahgnment scheme simply paired the fiiist residue of an amino acid sequence with the first 
position of a target profile, with no allowance for gaps. This scheme was possible because 
the amino acid sequence lists against which the targets were threaded only included 
sequences of identical length to each target corresponding to monomeric structures from the 
PDB. The total number of identical length sequences for each target ranged from 6 to 35, 
with an average of 19 ± 8 sequences per target (Table 1). No attempt was made to optimize 
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the gap opening and extension penalties for the local algorithm; in all cases these were the 
defaults given in the PROFILESEARCH package, 0.1 and 0.05, respectively. 

[0101] The results of the fold recognition experiments are shown in Figure 5 A, 
Figure 5B, Figure 5C and Figure 5D, and at least three conclusions are drawn from this data. 
First, scoring matrices composed of either COREX stability or DSSP secondary structure 
data performed better than randomized data sets in matching a structural target to its amino 
acid sequence. In Figure 5A, Figure 5B, Figure 5C and Figure 5D, the results for COREX 
data are stacked toward the left (successful) side of the rankings, while the randomized data 
approaches a bell-shaped distribution with a maximum near the median of the size of the 
sequence datasets (approximately 10 for the mean size of 19 sequences). Second, for both 
COREX and DSSP scoring matrices, the global algorithm (which took the entire amino acid 
sequence into account) performed significantly better than the local algorithm (which 
generally aligned only a subset of the sequence). Third, the total number of targets falling in 
the most successful bin was similar for both the COREX stability and secondary structure 
matrices, suggesting that COREX stability propensities alone contained a comparable amount 
of information to secondary structure propensities, 

[0102] Because the local alignment algorithms used here compute a score 
without returning the complete ahgnment of profile to sequence, high scores may have been 
possible fi-om non-structurally significant local alignments. In other words, it is possible that 
a correct sequence may have scored well against its corresponding target structure without 
having placed the individual amino acids in their correct positions within the structure. The 
use of the global alignment in conjimction with amino acid sequences of identical length 
partially alleviated this problem, as no misalignment was allowed in the global scheme. 

Example 9 

Successful Alignment Based on COREX Stability 

[0103] To assess the extent of local alignments that were structurally 
significant, minor modifications were made to the PROFILESEARCH source code that saved 
the traceback of the alignment matrix. It was found that for targets scoring poorly in the fold- 
recognition rankings, local ahgnments of the corresponding sequence were often not 
significant. However, sequences that scored in the top two bins were often found to be 
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completely and correctly aligned with their target profiles, even though not all of their 
residues contributed to the overall score due to the rules of the local algorithm. Three 
examples of successful alignment based on COREX stability data alone are shown in Figures 
6A, 6B, 6C and Tables 4A, 4B, 4C for the targets Protein G (ligd), DNA topoisomerase I 
(Ivcc), and tendamistat (2ait), respectively. The alignments calculated using the local 
algorithm were correct, despite the fact that no sequence information about the target was 
used, and that only a subset of the amino acid sequence was used in the scoring. In addition, 
it is noteworthy that the success of these examples is not due to merely a small fragment of 
the sequence, as the cumulative 3D- ID matrix score steadily increase over the entire length 
of the sequence. 



Table 4 A. Local Alignment Score of ligd Sequence to ligd Stability Profile 
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3.67 


24 


K 


L 


0.22 


3.89 
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25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60^ 

61 



_V_ 
D 



K 

_Q_ 
Y 
A 
N 
D 
N 



D 



V 

w 

T 



D 



T 



X 

V 



M 
M 

M 

H 

H 

H 

H 

H 

H 

H 

H 

H 

M 

M 

M 

M 

M 

M 

M 

M 
H 
H 
H 
M 
M 
M 
M 
M 
H 
H 
H 

H_ 



H 



1 



0.05 
'033 
'020 
'0.03 
0.15 
0.05 
^0.02 
^0.08 
^0.29 
-0.02 
0.64 
^0.29 
0.34 
0.48 
-0.02 
-0.25 
0.26 
0.10 
-0.05 
0.17 
0.26 
-0.05 
0.17 
0.55 
-0.52 
0.48 
0.26 
0.26 
-0.03 
0.05 
0.00 
-0.52 
0.64 
-0.52 
0.08 
-0.52 
-0.08 



3,94 

3.61 
3.41 
3.38 
3.53 
3.58 
3.56 
3.48 
3.19 
3.17 
3.81 
3.52 
3.86 
4.34 
4.32 
4.07 
4.33 
4.43 
4.38 
4.55 
4.81 
4.76 
4.93 
5.48 
4.96 
5.44 
5.70 
5.96 
5.93 
5.98 
5.98 
5.46 
6.10 
5.58 
5.66 
5.14 
5.06 



* One of skill in the art recognizes that the Residue types are listed by the one letter amino 

^""h, Mr^dL^denote high, medium, and low stabiUty as defined in the text and in footnote b 

°^VaSe of the 3D-1D scoring matrix corresponding to the results of optimal aliment of the 
ligd ammo acid sequence given m the "Residue Type" colimm to the ligd stability profile 
given in the "StabiUty Enviromnent" column. These values are highly similar, but not 
identical, to the average values given m Table 3 A because these values are firom the scormg 
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matrix produced when the target protein was removed from the database, as described in the 
text. 

^ Sum of all the values in the "3D- ID Matrix Score" column up to and including the 
indicated residue nimiber. Values in boldface were used by the local alignment algorithm 
(Smith & Waterman, 1981) to compute the optimal sequence to profile alignment. 
Data in the "Cumulative Local Ahgmnent Score" column was used to generate Figure 5 A. 
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Table 4B> Local Alignment Score of 1 vcc Seqnence to Ivcc StabUity Profile 



Number 


jvcaiuiic 

TvDe* 


Envtronment^ 


Matrix 
Score'* 


^iitmilfii^VA 

^UlUUlailVC 

Local 
Alignment 
Score'''' 


1 


M 


H 


-0.08 


-0.08 


2 


R 


H 


0.30 


0.22 


3 


A 


H 


-0.01 


0.21 


4 


L 


H 


0.19 


0.40 


5 


F 


H 


0.66 


1.06 


6 


Y 


M 


-0.14 


0.92 


7 


K 


L 


0.19 


1.11 


8 


D 


L 


-0.25 


0,86 


9 


G 


L 


0.53 


1.39 


10 


K 


L 


0.19 


1.58 


11 


L 


M 


-0.04 


1.54 


12 


F 


H 


0.66 


2.20 


13 


T 


M 


0.00 


2.20 


14 


D 


M 


0.28 


2.48 


15 


N 


M 


0.06 


2.54 


16 


N 


M 


0.06 


2.60 


17 


F 


M 


-0.36 


2.24 


18 


L 


M 


-0.04 


2.20 


19 


N 


M 


0.06 


2.26 


20 


P 


M 


-0.11 


2.15 


21 


V 


M 


0.19 


2.34 


22 


s 


M 


-0.19 


2.15 


23 


D 


M 


0.28 


2.43 


24 


D 


M 


0.28 


2.71 


25 


N 


M 


0.06 


2.77 


26 


P 


M 


-0.11 


2.66 


27 


A 


M 


-0.04 


2.62 


28 


Y 


H 


0.50 


3.12 


29 


E 


M 


-0.10 


3.02 


30 


V 


M 


0.19 


3.21 


31 


L 


M 


-0.04 


3.17 


32 


Q 


M 


-0.04 


3.13 


33 


H 


L 


0.22 


3.35 


34 


V 


L 


-0.32 


3.03 


35 


K 


L 


0.19 


3.22 


36 


I 


L 


-0.31 


2.91 


37 


P 


L 


0.47 


3.38 


38 


T 


L 


0.32 


3.70 
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Ul 



39 


rl 


T 


0.22 


3.92 


y< A 

40 


T 

1^ 


T 
1^ 


-0.19 


3.73 


41 


i 


T 


0.32 


4.05 


42 


iJ 


T 

1^ 


-0.25 


3.80 


43 


V 


M 


0.19 


3.99 


44 


V 


H 


0.06 


4.05 


45 


V 


H 
1 i. 


0.06 


4.11 


46 


V 
I 




0.50 


4.61 


47 




H 


-0.10 


4.51 


A O 

4o 




H 


0.34 


4.85 


49 




H 


-0.47 


4.38 


3U 


w 


H 


0.55 


4.93 


C 1 




H 


-0.10 


4.83 


dL 




M 


0.15 


4.98 


CI 
DD 


A 


M 


-0.04 


4.94 


CA 

54 


T 

1^ 


M 


-0.04 


4.90 


55 


i 


M 


0.00 


4.90 


56 


K 


iVX 


-0.06 


4.84 


57 


T 


H 

X X 


0.19 


5.03 


CO 

58 


T 
1 


X X 


0.10 


5.13 


59 


r 


X X 


0.66 


5.79 


60 


V 


n. 


0.06 


5.85 


61 


Lr 


H 

XT. 


-1.11 


4.74 


62 


c 


iVl 


-0.19 


4.55 


63 


jJ 


T 


-0.25 


4.30 


64 


c 

O 


T 
i-/ 


-0.05 


4.25 


65 




T 

1^ 


0.19 


4.44 


66 




T 

X-. 


0.53 


4.97 


6/ 


D 
Iv 


L 


-0.34 


4.63 


68 


R 


H 


0.30 


4.93 


69 


Q 


H 


0.34 


5.27 


70 


Y 


M 


-0.14 


5.13 


71 


F 


M 


-0.36 


4.77 


72 


Y 


L 


-0.73 


4.04 
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35 



73 


G 


L 


0.53 


4.57 


74 


K 


L 


0.19 


4.76 


75 


M 


L 


0.04 


4.80 


76 


H 


L 


0.22 


5.02 


77 


V 


L 


'0.32 


4.70 



One of skill in the art recognizes that the Residue types are listed by the one letter amino 
acid designation. 

^ M, and L denote high, medium, and low stability as defined in the text and in footnote b 
of Table 3. 

^ Value of the 3D" ID scoring matrix corresponding to the results of optimal alignment of the 
1 vcc amino acid sequence given in the "Residue Type" column to the ligd stability profile 
given in the "Stability Environment column. These values are highly similar, but not 
identical, to the average values given in Table 3A because these values are fi*om the scoring 
matrix produced when the target protein was removed firom the database, as described in the 
text. 

Sum of all the values in the "3D- ID Matrix Score" column up to and including the 
indicated residue number. Values in boldface were used by the local alignment algorithm 
(Smith & Waterman, 1981) to compute the optimal sequence to profile alignment. 

Data in the "Cumulative Local Alignment Score" column was used to generate Figure 5B. 
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Table 4C. Local Alignment Score of 2ait Sequence to 2ait Stability Profile 



Residue 


Residue 


StabUity 


3D-1D 


Cumulative 


Number 


Type* 


Environment^ 


Matrix 


Local 






score 


Alignment 










Score'''' 


1 


N 


L 


-0.21 


-0.21 


2 


T 


L 


0.31 


0.1 


3 


T 


L 


0.31 


0.41 


4 


V 


L 


-0.3 


0.11 


5 


s 


L 


-0.06 


0.05 


6 


E 


L 


-0.11 


'0.06 


7 


P 


L 


0.47 


0,41 


8 


A 


M 


-0.04 


0.37 


9 


P 


M 


-0.1 


0.27 


10 


S 


M 


-0.14 


0.13 


11 


C 


M 


-0.19 


-0.06 


12 


V 


M 


0.18 


0.12 


13 


T 


M 


-0.02 


0.1 


14 


L 


M 


-0.02 


0.08 


15 


Y 


H 


0.44 


0.52 


16 


Q 


H 


0.34 


0.86 


17 


s 


H 


0.18 


1.04 


18 


w 


H 


0.55 


1.59 


19 


R 


H 


0.27 


1.86 


20 


Y 


H 


0.44 


2.3 


21 


S 


H 


0.18 


2.48 


22 


Q 


H 


0.34 


2.82 


23 


A 


H 


-0.02 


2.8 


24 


D 


H 


-0.14 


2.66 


25 


N 


M 


0.11 


2.77 


26 


G 


L 


0.53 


3.3 


27 


C 


L 


-0.11 


3.19 


28 


A 


L 


0.05 


3.24 


29 


E 


L 


-0.11 


3.13 


30 


T 


L 


0.31 


3.44 


31 


V 


M 


0.18 


3.62 


32 


T 


M 


-0.02 


3.6 


33 


V 


H 


0.06 


3.66 


34 


K 


H 


-0.28 


3.38 


35 


V 


H 


0.06 


3.44 


36 


V 


H 


0.06 


3.5 


37 


Y 


H 


0.44 


3.94 


38 


E 


M 


0.14 


4.08 


39 


D 


M 


0.28 


4.36 
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M 


0.28 


4.64 


41 


T 


M 


-0.02 


4.62 




p 
1^ 


M 


0.14 


4.76 




G 


M 


-0.04 


4.72 


44 


L 


M 


-0.02 


4.7 




c 


M 


-0.19 


4.51 


46 


Y 


H 


0.44 


4.95 


47 


A 


M 


-0.04 


4.91 


48 


V 


M 


0.18 


5.09 


4Q 


A 


M 


-0.04 


5.05 




p 


M 


-0.1 


4.95 


SI 


G 


L 


0.53 


5.48 


DL 


o 


M 


-0.04 


5.44 


S^ 


T 


L 


-0.34 


5.1 


3t 


T 


L 


0.31 


5.41 


-^.^ 


T 


M 


-0.02 


5.39 


30 


V 

V 


M 


0.18 


5.57 




G 


M 


-0.04 


5.53 


DO 


n 


M 


0.28 


5.81 






M 


-0.04 


5.77 


AH 
OU 


Y 


M 


-0.09 


5.68 


Oi 


T 
i 


L 


-0.34 


5.34 


^1 
o/ 


vj" 


L 


0.53 


5.87 


oj 


Q 

o 


L 


-0.06 


5.81 




H 


L 


0.3 


6.11 


Oj 


G 


L 


0.53 


6.64 


oo 


H 
1 1 


M 


-0.43 


6.21 


0 / 


A 


H 


-0.02 


6.19 


Oo 


D 
Iv 


H 

X X 


0.27 


6.46 


oy 


Y 


H 


0.44 


6.9 


70 


L 


H 


0.18 


7.08 


71 


A 


H 


-0.02 


7.06 


72 


R 


H 


0.27 


7.33 


73 


C 


H 


0.24 


7.57 


74 


L 


H 


0.18 


7.75 


>kill in the art recognizes 


that the Residue types are listea by me one letie 



acid designation. . .r ♦ u 

^ H, M, and L denote high, medium, and low stabiUty as defined m the text and m footnote b 

of Table 3. ^ . , , 

^ Value of the 3D-1D scoring matrix corresponding to the results of optimal ahgnment ot the 
2ait amino acid sequence given in the "Residue Type" column to the ligd stability profile 
given in the "Stability Environment" column. These values are highly sunilar, but not 
identical, to the average values given in Table 3A because these values are firom the scoring 
matrix produced when the target protein was removed fi-om the database, as descnbed m the 
text. 

25112195.1 38 



^ Sum of all the values in the "3D- ID Matrix Score" column up to and including the 
indicated residue number. Values in boldface were used by the local alignment algorithm 
(Smith & Waterman, 1981) to compute the optimal sequence to profile alignment. 
^ Data in the "Cumulative Local Alignment Score" colunm was used to generate Figure 5C, 
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Example 10 
State of Ensemble Using COREX 

[0104] A database of 81 proteins, 5849 residues total (Table 5), was selected 
from the Protein Data Bank (Baldwin and Rose, 1999) on the basis of biological and 
computational criteria as described previously in Example 1. 

[0105] Next, the COREX algorithm (Hilser & Freire, 1996) was run with a 
window size of five residues on each protein in the database. The minimum window size was 
set to four, and the simulated temperature was 25 **C. The COREX algorithm generated an 
ensemble of partially unfolded microstates using the high-resolution structure of each protein 
as a template (Hilser & Freire, 1996) similar to Example 2. This was facilitated by 
combinatorially unfolding a predefined set of folding units {i.e., residues 1 - 5 are in the first 
folding unit, residues 6-10 are in the second folding unit, etc.). By means of an incremental 
shift in the boundaries of the folding units, an exhaustive enumeration of the partially 
unfolded species was achieved for a given folding xmit size (Hilser & Frieir, 1996; Wrabl, et 
al, 2001). 

[0106] Next, the Gibbs free energy for each state, AG/ relative to the fiiUy- 
folded reference state was calculated from surface area- and conformational entropy-based 
parameterizations described previously in Example 2 (Wrabl et al., 2001). Thus, the AG/, of 
each state arises from differences in solvation of apolar and polar surface area, and from 
differences in conformational entropy between each state and the reference state. Therefore, 
dividing the free energy into its component terms gives: 

AG, = AG,^,^, + AG,,,,,, + AG,,,^, (9) 

[0107] As Equation 9 indicates, different values for the component 
contributions can provide similar magnitudes for AG/, suggesting that different states can 
have similar stabilities, but different mechanisms for achieving that stability. 
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Example 11 
Surface Area Calculations 



[0108] The calorimetric enthalpy and entropy of solvation were parameterized 
from polar and apolar surface exposure (Hilser & Freire, 1996). COREX uses empirical 
parameterizations to calculate the relative apolar and polar free energies of each microstate: 

A^?.,.,.,, {T) = -8.44 * A^&4,,.,.,, + 0.45 * MSA,^^^, * {T - 333) 
-r*(0.45* A^&4,^,^,, *ln(r/385)) 

^G^,u.A^^^lAAA^MSA^,^,-Q26^M^^ *(J-333) 
-r*(-0.26* '^hiCr/SSS)) 

[0109] The three primary components used to calculate conformational 
entropies {lsS>i^conj) for each microstate were: (1) ASbu->ex^ the entropy change associated with 
the transfer of a side-chain that was buried in the interior of the protein to its surface; (2) 
ASex^u? the entropy change gained by a surface-exposed side-chain when the peptide 
backbone unfolds; and (3) ASbb, the entropy change gained by the backbone itself upon 
unfolding (Hilser & Freire, 1996). For fold recognition calculations, the total (ASj,con/) of all 
proteins is multiplied by a scaling factor to eliminate the unfolded state contribution to the 
residue-specific thermodynamic parameters. 

[0110] Next, the residue stability constant, Kf, was calculated similar to 
Example 2. The residue stability constant is the ratio of the summed probability of all states 
in the ensemble in which a particular residue, 7, is in a folded conformation (SPfj) to the 
summed probability of all states in which residue 7 is in an unfolded {le,, non-folded) 
conformation (SPw/j). 

[0111] Equation 2, in turn, was used to define a residue-specific free energy of 
folding for the protein =-RI\nKf )' which was expanded to give 

^ AG = RT \nQ — RT \nQ ) where Q„/j and Qfj were the sub-partition ftmctions for 

states in which residue j was unfolded and folded, respectively. Thus, the residue-specific 
free energy provides the differ^ce in energy between the sub-ensembles in which each 
residue is folded and unfolded. In other words, the residue stability constant does not provide 

25112195,1 47 



the contribution of each amino acid to the stabiUty of a protein. Rather, it provides the 
relative stability of that region of the protein, implicitly considering the contribution of all 
amino acids in the protein toward the observed stability at that position. 



[0112] As shown in Figure 8, the stability constants provided a residue-specific 
description of the regional differences in stability within a protein structure. The importance 
of this quantity from the point of view of fold recognition is two-fold. First, the stability 
constant is compared directly to protection factors obtained from native state hydrogen 
exchange experiments, thus providing an experimentally verifiable residue-specific 
description of the ensemble. Second, as amino acids are non-randomly distributed across 
high, medium and low stability environments, the stability constant as a fimction of residue 
position provides a convenient 1 -dimensional representation of the 3-dimensional structure. 

Example 12 

Identification of Additional Thermodynamic Determinats 

[0113] First, the AG/ for each microstate i in the ensemble was composed of 
solvation and conformational entropy terms as described by Equation 9 and Example 10. 
Equation 9 was rewritten in terms of the enthalpic and entropic components: 

AGi = AHi^ solvation ~T(ASi^ solvation + ^.^nfo^mational) (1^) 

[0114] Each of the solvation terms in Equation 12 was fiirther expanded into 
contributions based on apolar and polar surface area: 

^G|= (AHj^ solvation,apolar"^^^, solvation,polar)""T(ASi^ soIvation,apolar AS^^ solvat3on,polar) 'T(ASi^ conformational) ( 1^) 

[0115] However, the identical values for the apolar and polar areas of each state 
were used for the respective terms in the enthalpy and entropy calculations. Therefore, the 
absolute values for the enthalpy and entropy terms for a given area type were related by 
constants ki (for apolar area) and k2 (for polar area), yielding the expression: 

AGi= (AH^^ solvation,apolar"^'^H, soivation.poiar)"T(kiAHj^ ^^i^^^^^ gp^ig^+kjAHi^ solvation,polar) '^(ASi^ 
conformational) (^4) 
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[01 16] Grouping area types together and simplifying gives; 



*(l-T*kJ] + [(AH,«^^^)*(l^T*k,)]-T(AS,<^,,^ (15) 

[0117] Equation 15 revealed that for a given free energy and conformational 
entropy, the relative contribution of polar and apolar surface to the solvation free energy was 
ascertained from the ratio of polar to apolar enthalpy for each state. 

[0118] Thus, to arrive at a residue-specific contribution of polar and apolar 
solvation, a given thermodynamic parameter (i.e. enthalpy or entropy) is considered an 
average excess quantity, which represents the population-weighted contribution of all states 
in the ensemble. For instance, the average excess enthalpy and entropy was defined as: 



^slate s ^stale s ^ A J-J 

1=1 i=l y 



(16A) 



^state s ^state s m 



(16B) 



[0119] Following from Equations 16A and 16B, residue-specific descriptors of 
the polar and apolar enthalpy were defined accordingly. The polar component of the 
enthalpy was defined as the difference between the average excess polar enthalpy from the 
sub-ensemble in which residue j is folded {kAH^^^j j >) and the average excess polar 
enthalpy from the sub-ensemble in which residue j is unfolded (< AH^^^^^j- j >): 



where: 



A' folded 



N j^ not f olded 

<^pol,nf,j> = Z 
/=1 



QnfJ 



(17) 

(18) 
(19) 



25112195.1 



49 



[0120] It is important to note that the summations in Equations 18 and 19 were 
only over the sub-ensembles in which residue j was folded and unfolded, respectively, and 
the parameters Q/j and Qn/j were the sub-partition functions for those sub-ensembles. By 
identical reasoning, the residue-specific apolar component to the enthalpy of residue j and the 
residue-specific conformational entropy component of residue j were defined as: 

Aff,,,,, = < AH^^^,,^j > - < Afl^^;,,^,, > (20) 
AS,^^^, = < AS^^„,^,j > - < A5,,,^,^,, > (21) 

[0121] As in the case with the residue stability constant, the expressions for the 
residue-specific ARapoij, AHpoijBnd AScon/jdo not provide the contributions of residue j to the 
respective overall thermodynamic properties. Instead, Equations 17, 20 and 21 reflect the 
average thermodynamic environments of that residue, accounting implicitly for the 
contribution of all the amino acids over all the states in the rasemble. 

Example 13 

Residue-Specific Tliermodyiiamic Environments 

[0122] Using Equations 2, 17, 20, and 21, thermodynamic environments were 
empirically defined so as to systematically account for the different contributions of solvation 
and conformational entropy to the overall stability constant of each residue. As shown in 
Figure 9A-Figure 9C, three thermodynamic dimensions were considered; stabiUty (f^j), 
enthalpy (H^^^^j), and entropy {S^^ j). The first dimension utiUzes the stability constant 
classification (Figure 8A and Figure 8B) defined by Equation 2. As the particular value for 
the stability constant can arise from conformational entropy or solvent related phenomena, a 
second dimension was utilized that provided the ratio of the conformational entropy to the 
total solvation firee energy; 

Sra.oJ-^^ * (22) 

[0123] where AGsoivj is the total residue-specific solvation component 
calculated similar to Equations 17-21. Finally, as the total solvation component can arise 
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from polar or apolar contributions, a third dimension was incorporated that provided the ratio 
of polar to apolar enthalpy described by Equations 17 and 20; 



H . ,^^Lp2LL (23) 

ratio,] .TT ^ ^ 

[0124] Thus, the residues making up the 81 proteins (Table 5) that were 
analyzed partitioned non-randomly within the three-dimensional thermodynamic space. The 
non-random distribution of residues resulted in an empirical partitioning of the residue- 
specific data into twelve thermodynamic categories by dividing the stability data into three 
categories, the enthalpy data into two categories, and the entropy data into two categories 
(Figure 9A-Figure 9C), 

Example 14 
Binning of Thermodynamic Environments 

[01251 Each of the 5849 residues in the database were binned into one of the 
twelve thermodynamic environment classes based on their stability {iqj), enthalpy {H^^ j\ 
and entropy {S^^^^j) values. These thermodynamic environments were denoted by the 
following abbreviations: LLL, LLH, LHL, LHH, MLL, MLH, MHL, MHH, HLL, HLH, 
HHL, HHH. For example, residues in the LMH thermodynamic environment were binned 
into the Low (L) stabiUty {Kfj) class, the Medium (M) enthalpy (H^^^j) class, and the High 
(H) entropy ( S^^^^ j ) class. The cutoffs for each thermodynamic class were defined as: 



StabiUty (%) class (L, M, or H): 

-Low Kfj (L) = [ hiK^j < 7.95 ] (22) 

-Medium k/j (M) = [ 7.95 <= hitCfj < 13 A ] (23) 

-High Kfj (H) ^ [ 13.4 <= InKfj] (24) 

Enthalpy (H^^^^j ) class (L or H): 

Low H^^,^j (L) ^ [ -Mlpoi < -1 .024 * AH«^ - 2553 ] (25) 
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High H,^j (H) ^ [ -AHp,/ >= -1.024 * AH,;, - 2553 ] (26) 



Entropy ( S^^^^j ) class (L or H): 

Low S,,^j (L) ^ [ -TASco«/< 0.125 * AG,,/v -3053 ] (27) 
High S,,^j (H) ^ [ -TASco«/>= 0.125 * /sGsoiv -3053 ] (28) 

[0126] Visual inspection of the segregation of amino acid types as a function of 
various thermodynamic parameters extracted from the 81 -protein COREX database, guided 
by the development outlined above, suggested that the general classifications of stability, 
enthalpy, and entropy was reasonably divided thermodynamic space (as indicated in Figure 
9). The exact cutoffs for the twelve residue-specific thermodynamic environments used in 
the threading calculations were determined automatically by an exhaustive grid search of all 
possible. The utility of each trial set of cutoffs was initially determined from a coarse search 
of cutoff space by threading a constant subset of 8 targets in the protein database and 
recording sets of cutoffs that maximized the Z-scores and percentiles for each target. Then, a 
finer grid search over the best sets of cutoffs, threading against a subset of 20 targets for each 
trial set of cutoffs, resulted in the optimized set of cutoffs used for the threading experiments 
shown in this work. Identical cutoffs were used for the alpha^eta threading calculations, z.e. 
no special optimization was performed for the scoring of the alpha^eta experiment. 

[0127] Statistics for amino acid type as a function of each of the 
thermodynamic environments were tabulated (Table 6) and the log-odds probability for an 
amino acid type to be in each thermodynamic environment was calculated. The resulting 
histograms (Figure 10) revealed a non-random distribution of the amino acids within the 
thermodynamic environments. For example, hydrophobic residues such as He, Phe, and Val 
were observed with lower frequency in the MLL environment, while polar and charged 
amino acids such as Asp, Gin, and Lys were observed with higher frequency in this 
environment. These distributions cannot always be rationalized on the basis of side chain 
chemical properties, however, as the basic amino acids Arg and Lys exhibited very different 
propensities to occur in the MHL environment. This latter observation must be a reflection of 
the fact that ensemble-derived energetics included averaged tertiary enthalpic and entropic 
information that is not encoded by individual side chain properties alone. 
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Example 15 
Fold-Recognition Details 

[0128] Simple fold-recognition experiments were performed based on amino 
acid distributions within the twelve thermodynamic environments. 

[0129] Briefly, a profiling method was used to create thermodynamic 
environment profiles for each of the 81 proteins in the database (Bowie et al, 1991; Gribskov 
et al, 1987)» The 81 amino acid sequences (Table 5) coding for the native structures used in 
the database (in addition to 3777 decoy sequences) were each threaded against the 81 target 
thermodynamic environment profiles. The decoy sequences were obtained firom the Protein 
Data Bank and were inclusive for all sequences coding for "foldable" proteins ranging fi-om 
35 to 100 residues. 

[0130] Next, a 3D-1D scoring matrix for each protein in the database was 
calculated, in which the scoring matrix data was simply the log-odds probabilities of finding 
amino acid types in one of the thermodynamic environment classes (Equation 30, below). 
The resulting profile of the target protein was then optimally aligned to each member of a 
library of amino acid sequences (i.e. 3858 decoy sequences) by maximizing the score 
between the sequence and the profile using a local alignment algorithm based on the Smith- 
Waterman algorithm (Smith & Waterman, 1981) as implemented in PROFILESEARCH 
(Bowie et al, 1991). No attempt was made to optimize the gap opening and extension 
penalties for the local algorithm; in all cases these were the default values given in the 
PROFILESEARCH package, 5.00 and 0.05, respectively. Z-scores were computed fi-om 
PROFILESEARCH for each threading result fi-om Equation (30): 

Z = (s-a)/<S> (30) 

[0131] In Equation 30, s was the PROFILESEARCH threading score of a 
sequence i when threaded against the structure corresponding to sequence i, <S> was the 
average threading score of all sequences in the database (identical in length to sequence i) 
threaded against the structure corresponding to sequence U and a was the standard deviation 
of the scores of all sequences in the database (identical in length to sequence 0 threaded 
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against the structure corresponding to sequence Thus, the Z-score was the number of 
standard deviations above the mean that sequence / scored against its target. 

[0132] Nearly three-fourths (60/81) of the correct sequences scored in the top 
5* percentile when threaded against their corresponding thermodynamic environment profile 
(Figure 10), and the Z-scores (the number of standard deviations a particular sequence scored 
above the mean score of all chains of identical length) for these successful threadings ranged 
from 1.76 to 12.23 (Table 7). 



Table 1. Fold Recognition Results 



No. 


PDB • 


% Rank , 


Z SCORE 




No. 


PDB • 


% Rank , 


I SCORE 


1 


1A1i:A 


0.29 


3.49 




41 


1MJC: 


4.07 


1.99 


2 


1A6S: 


0.67 


3.23 




42 


1MKN:A 


3.24 


2.33 


3 


1A80: 


0.34 


3.29 




43 


1MOF: 


65.34 


-0.47 


4 


1AA3: 


3.84 


2.08 




44 


1MWP:A 


24.29 


0.56 


5 


1ABA: 


0.03 


4.1 




45 


1NHM:_ 


17.26 


0.93 


6 


1ADR: 


0.93 


3.71 




46 


1NKL: 


0.91 


3.19 


7 


1 AlW' 

1 V V . 


2.36 


2.27 




47 


1NPS:A 


0.13 


4.36 


ft 


1 AN4A 


23.64 


0.68 




48 


1NRE: 


24.29 


0.54 


9 


1AOI:B 


26.31 


0.52 




49 


1NTC:A 


39.71 


0.1 


10 


1AVY:C 


5.16 


1.82 




50 


1NXB: 


0.78 


4.1 


1 1 


1 RQG'A 


0.18 


4.48 




51 


10PD: 


4.15 


2.09 


1 ^ 


IRDD- 


0.44 


5.07 




52 


10TF:A 


1.09 


3.49 


1*^ 
1 o 


1 Rno- 


0 05 


6.25 




53 


1PCF:A 


40.95 


0.17 


14 


1BF4:A 


0.16 


4.04 




54 


1PGB-. 


0.13 


5.9 


15 


1BG8:A 


33.23 


0.32 




55 


1PLC: 


0.13 


8.42 


16 


1B09:A 


0.21 


4.06 




56 


1PTF: 


7.34 


1.63 


17 


1C1Y:B 


95.44 


-1.46 




57 


1PTQ: 


9.62 


1.33 


18 


1CC5: 


0.13 


5.3 




58 


1PTX: 


0.47 


4.21 


19 


1CHC: 


67.88 


-0.55 




59 


1QA4:A 


45.59 


-0.05 


20 


1CTF: 


32.17 


0.22 




60 


1QGW:B 


2.95 


2.25 


21 


1CYO: 


5.47 


1.76 




61 


1QQV:A 


1.87 


2.73 


22 


1D3B:B 


0.93 


2.7 




62 


1R1B:A 


22.76 


0.68 


23 


IDOQ-.A 


0.03 


4.34 




63 


1ROP: 


42.48 


0.02 


24 


1DT4:A 


0.08 


6.83 




64 


1RZL: 


0.05 


6.57 


25 


1EGW:A 


4.33 


2.14 




65 


1SHG: 


0.08 


6.09 


26 


1EO0:A 


0.88 


4.01 




66 


1SKN:P 


0.03 


6.28 


27 


1FGP: 


2.13 


2.65 




67 


1SVF:B 


20.14 


0.67 


28 


1GDC: 


64.41 


-0.45 




68 


1TB A: A 


1.09 


2.68 


2S 


1 IHCRiA 


0.16 


4.7 




69 


1TGS:I 


2.62 


2.6 


3C 


1 1HDJ: 


1.35 


. 2.8 




7C 


1TRL:A 


23.54 


0.53 


31 


1H0E: 


0.13 


5.62 




71 


1UGI:D 


0.44 


\ 7.02 
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36 


1KDX:A 


0.03 


9.34 


37 


1KJS: 


32.4 


0.26 


38 
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2.41 


2.5 
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76 


2CI2.I 


5.44 


2.06 


77 


2KNT: 


0.08 


12.23 


78 


2SPG:A 


0.39 


5.31 


79 


3EIP:A 


0.18 


5.53 


80 


3NCM:A 


0.44 


4.24 


81 


5HPG:A 


0.05 


11.02 



Example 16 
Construction of Scoring Matrices 

[0133] The scoring matrices were calculated as log-odds probabilities of finding 
residue type j in structural environment K as described below (Wrabl et al, 2001; Bowie et 
aL, 1991). The matrix score, Sj\ky was defined as: 

Sy,, = hl^ (27) 



[0134] Vjik is the probability of finding a residue of type j in stability class k (Le. 
number of counts of residue type j in stability class k divided by the total number of counts of 
residue type j), and Pjfc is the probability of finding any residue in the database in stability 
environment k (i.e. number of residues in stability class regardless of amino acid type, 
divided by the total number of residues in the entire database, regardless of amino acid type). 
The structural environment used was one of the twelve COREX thermodynamic 
environments (LHH, LHL, LLH, LLL, MHH, MHL, MLH, MLL, HHH, HHL, HLH, HLL), 
as described above. The fold recognition target was removed fi:om the database, and the 
remaining 80 proteins were used to calculate the probabilities. Therefore, information about 
the target was never included in the scoring matrix. 
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Example 17 

Thermodynamic Information is more Fundamental 
than Secondary Structure Information 

[0135] Secondary structure, althou^ useful in the analysis and classification of 
protein folds, is an easily reportable observable that does little to explain the underlying 
physical chemistry of protein structure. In fact, secondary structure can be viewed as a 
manifestation of the backbone/side-chain van der Waals' repulsions that divide phi/psi space, 
modified by the thermodynamic stability afforded by local and tertiary interactions such as 
hydrogen bonding and the hydrophobic effect (Srinivasan & Rose, 1999; Baldwin & Rose, 
1999). Any reasonable description of the energetics of protein structure must be able to 
reflect these realities independent of secondary structural propensities of amino acids and the 
secondary structural classifications of folds. 

[0136] Although the COREX energy function accounts for specific 
interactions only in an implicit way, the results of a COREX calculation may provide deeper 
insight than secondary structure into the structural determinants of protein folds. For 
example. Figure 9C compared the thermodynamic environment profiles for an all-alpha 
protein and an all-beta protein threaded over their native folds. Visual inspection of the two 
color-coded structures revealed that different thermodynamic environments span single types 
of secondary structure, and that the same thermodynamic environment was found in different 
types of secondary structural elements. 

[0137] Thus, a threading procedure was repeated on a subset of proteins from 
the original database (Table 5), sorted by secondary structure to determine the possibility that 
the thermodynamic environments calculated by COREX represented a fundamental property 
of proteins that transcended structural classifications. 

[0138] First, a scoring table was assembled from the 31 proteins in Table 5 that 
were classified by the SCOP database as being "All alpha" proteins. Second, the 12 "All 
beta" proteins from Table 5 were threaded using the scoring table derived solely from the 
"All alpha" proteins. In other words, amino acid propensities for the thermodynamic 
environments from all-alpha proteins were used to perform fold recognition experiments on 
all-beta proteins. For more than 80% of the targets (10/12), sequences known to adopt the 
native all-beta structures scored in the top 5% of the 3858 decoy sequences, (Figure 12). 
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[0139] This result was a clear demonstration that the energetic information 
derived from the COREX calculations was indqpendent of protein secondary structure. 
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[0141] Although the present invention and its advantages have been described 
in detail, it should be understood that various changes, substitutions and alterations can be 
made herein without departing from the spirit and scope of the invention as defined by the 
appended claims. Moreover, the scope of the present application is not intended to be limited 
to the particular embodiments of the process, machine, manufacture, composition of matter, 
means, methods and steps described in the specification. As one of ordinary skill in the art 
will readily appreciate from the disclosure of the present invention, processes, machines, 
manufacture, compositions of matter, means, methods, or steps, presently existing or later to 
be developed that perform substantially the same function or achieve substantially the same 
result as the corresponding embodiments described herein may be utilized according to the 
present invention. Accordingly, the appended claims are intended to include within their 
scope such processes, machines, manufacture, compositions of matter, means, methods, or 
steps. 
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